While Home Equity Lines of Credit (HELOCs) can be a useful tool for homeowners, they present significant risks for retail banks. Proper risk management and compliance with regulatory requirements are essential for banks to minimize their exposure to loan defaults. The project proposes a tuned Decision Tree classifier model for the prediction of loan default. Although the XGBoost Classifier model offered the best overall metrics, the tuned Decision Tree classifier was preferred because it provided the following two advantages:
Key insights from the analysis of 5,960 recent home equity loans are as follows:
The model can be used during the phase of screening loan applicants (given that the required input variables are available) and subsequently deciding whether or not to approve the issuance of the loan.
A major proportion of retail bank profit comes from interests in the form of home loans. These loans are borrowed by regular income/high-earning customers. Banks are most fearful of defaulters, as bad loans (NPA) usually eat up a major chunk of their profits. Therefore, it is important for banks to be judicious while approving loans for their customer base. The approval process for the loans is multifaceted. Through this process, the bank tries to check the creditworthiness of the applicant on the basis of a manual study of various aspects of the application. The entire process is not only effort-intensive but also prone to wrong judgment/approval owing to human error and biases. There have been attempts by many banks to automate this process by using heuristics. But with the advent of data science and machine learning, the focus has shifted to building machines that can learn this approval process and make it free of biases and more efficient. At the same time, one important thing to keep in mind is to make sure that the machine does not learn the biases that previously crept in because of the human approval process.
The objective is to build a classification model to predict clients who are likely to default on their loan and give recommendations to the bank on the important features to consider while approving a loan.
We want to help the bank avoid two type of errors:
The latter scenario is the costliest to the bank and we want to build a model that minimizes false negatives - that is predicting that a borrower will not default when the borrower actually default.
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
pip install nb_black
pip install xgboost
# Library to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# Use code formatting
%load_ext nb_black
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
import random
# Libraries to help with data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Libraries to scale and split the data, as well as tune the model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
# Libraries to access the machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier,
AdaBoostClassifier,
)
from xgboost import XGBClassifier
# Libraries to access metrics to evaluate the model
from sklearn.metrics import (
confusion_matrix,
classification_report,
precision_recall_curve,
make_scorer,
recall_score,
precision_score,
accuracy_score,
)
data = pd.read_csv("hmeq.csv")
# Copying data to another variable to avoid any changes to the original data
df = data.copy()
df.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
df.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | 0.0 | 16.0 | 36.112347 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | 0.0 | 15.0 | 35.859971 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | 0.0 | 15.0 | 35.556590 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | 0.0 | 16.0 | 34.340882 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | 0.0 | 16.0 | 34.571519 |
Observations:
df.shape
(5960, 13)
Observation:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
# Checking the number of unique values in each column
df.nunique()
BAD 2 LOAN 540 MORTDUE 5053 VALUE 5381 REASON 2 JOB 6 YOJ 99 DEROG 11 DELINQ 14 CLAGE 5314 NINQ 16 CLNO 62 DEBTINC 4693 dtype: int64
# Checking the categories for the REASON variable
df["REASON"].unique()
array(['HomeImp', nan, 'DebtCon'], dtype=object)
# Checking the categories for the JOB variable
df["JOB"].unique()
array(['Other', nan, 'Office', 'Sales', 'Mgr', 'ProfExe', 'Self'],
dtype=object)
Observations:
# Function for checking for missing values
def check_missing():
# Calculate the total number of null values per column
total_missing = df.isnull().sum().sort_values(ascending=False)
# Calculate the percentage of values that are null per column
percent_missing = (df.isnull().sum() / df.isnull().count()).sort_values(
ascending=False
)
# Format results into a data frame
missing_data = pd.concat(
[total_missing, round(percent_missing, 2)],
axis=1,
keys=["Total Missing", "Percent Missing"],
)
return missing_data
check_missing()
| Total Missing | Percent Missing | |
|---|---|---|
| DEBTINC | 1267 | 0.21 |
| DEROG | 708 | 0.12 |
| DELINQ | 580 | 0.10 |
| MORTDUE | 518 | 0.09 |
| YOJ | 515 | 0.09 |
| NINQ | 510 | 0.09 |
| CLAGE | 308 | 0.05 |
| JOB | 279 | 0.05 |
| REASON | 252 | 0.04 |
| CLNO | 222 | 0.04 |
| VALUE | 112 | 0.02 |
| BAD | 0 | 0.00 |
| LOAN | 0 | 0.00 |
Observations:
df[df.duplicated()]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC |
|---|
Observation:
# Creating numerical columns
num_cols = [
"LOAN",
"MORTDUE",
"VALUE",
"YOJ",
"DEROG",
"DELINQ",
"CLAGE",
"NINQ",
"CLNO",
"DEBTINC",
]
# Creating categorical columns
cat_cols = ["BAD", "REASON", "JOB"]
# Checking summary statistics
df[num_cols].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| LOAN | 5960.0 | 18607.969799 | 11207.480417 | 1100.000000 | 11100.000000 | 16300.000000 | 23300.000000 | 89900.000000 |
| MORTDUE | 5442.0 | 73760.817200 | 44457.609458 | 2063.000000 | 46276.000000 | 65019.000000 | 91488.000000 | 399550.000000 |
| VALUE | 5848.0 | 101776.048741 | 57385.775334 | 8000.000000 | 66075.500000 | 89235.500000 | 119824.250000 | 855909.000000 |
| YOJ | 5445.0 | 8.922268 | 7.573982 | 0.000000 | 3.000000 | 7.000000 | 13.000000 | 41.000000 |
| DEROG | 5252.0 | 0.254570 | 0.846047 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| DELINQ | 5380.0 | 0.449442 | 1.127266 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 |
| CLAGE | 5652.0 | 179.766275 | 85.810092 | 0.000000 | 115.116702 | 173.466667 | 231.562278 | 1168.233561 |
| NINQ | 5450.0 | 1.186055 | 1.728675 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 17.000000 |
| CLNO | 5738.0 | 21.296096 | 10.138933 | 0.000000 | 15.000000 | 20.000000 | 26.000000 | 71.000000 |
| DEBTINC | 4693.0 | 33.779915 | 8.601746 | 0.524499 | 29.140031 | 34.818262 | 39.003141 | 203.312149 |
Observations:
Leading Questions:
# Defining a hist_box() function** that provides both a boxplot and a histogram in the same visual
def hist_box(col):
f, (ax_box, ax_hist) = plt.subplots(
2, sharex=True, gridspec_kw={"height_ratios": (0.15, 0.85)}, figsize=(15, 10)
)
sns.set(style="darkgrid")
# Adding a graph in each part
sns.boxplot(df[col], ax=ax_box, orient="h", showmeans=True)
sns.distplot(df[col], ax=ax_hist)
ax_hist.axvline(
df[col].mean(), color="green", linestyle="--"
) # Green line corresponds to the mean in the plot
ax_hist.axvline(
df[col].median(), color="orange", linestyle="-"
) # Orange line corresponds to the median in the plot
plt.show()
# Amount of loan approved
hist_box("LOAN")
# Amount due on the existing mortgage
hist_box("MORTDUE")
# Current value of the property
hist_box("VALUE")
# Years at present job
hist_box("YOJ")
# Number of major derogatory reports
hist_box("DEROG")
# Number of delinquent credit lines
hist_box("DELINQ")
# Age of the oldest credit line in months
hist_box("CLAGE")
# Number of recent credit inquiries
hist_box("NINQ")
# Number of existing credit lines
hist_box("CLNO")
# Debt-to-income ratio
hist_box("DEBTINC")
Observations:
# Univariate analysis for categorical variables
for cat in cat_cols:
print(df[cat].value_counts(normalize=True))
print("*" * 40)
0 0.800503 1 0.199497 Name: BAD, dtype: float64 **************************************** DebtCon 0.688157 HomeImp 0.311843 Name: REASON, dtype: float64 **************************************** Other 0.420349 ProfExe 0.224608 Office 0.166872 Mgr 0.135011 Self 0.033973 Sales 0.019187 Name: JOB, dtype: float64 ****************************************
Observations:
# Checking how default rate is related with our categorical variables
cat_cols.remove("BAD")
for cat in cat_cols:
(pd.crosstab(df[cat], df["BAD"], normalize="index") * 100).plot(
kind="bar", figsize=(8, 4), stacked=True
)
plt.ylabel("Percentage Default %")
Observations:
# Checking the mean of numerical variables grouped by default
df.groupby(["BAD"])[num_cols].mean().T
| BAD | 0 | 1 |
|---|---|---|
| LOAN | 19028.107315 | 16922.119428 |
| MORTDUE | 74829.249055 | 69460.452973 |
| VALUE | 102595.921018 | 98172.846227 |
| YOJ | 9.154941 | 8.027802 |
| DEROG | 0.134217 | 0.707804 |
| DELINQ | 0.245133 | 1.229185 |
| CLAGE | 187.002355 | 150.190183 |
| NINQ | 1.032749 | 1.782765 |
| CLNO | 21.317036 | 21.211268 |
| DEBTINC | 33.253129 | 39.387645 |
Observations:
# Visualizing bivariate relations between 'BAD' and numerical variables
plt.figure(figsize=(10, 15))
for i, num_col in enumerate(num_cols):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df, x=df["BAD"], y=df[num_col])
plt.tight_layout()
plt.title(num_col)
Observations:
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="DEBTINC", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="DEBTINC", bins=50, col="JOB", row="BAD")
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="DEBTINC", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="DEBTINC", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafd0ad2dc0>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="LOAN", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="LOAN", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafd05d0fd0>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="LOAN", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="LOAN", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafcfed8c10>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="MORTDUE", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="MORTDUE", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafe476e5e0>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="MORTDUE", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="MORTDUE", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafd3648e80>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="VALUE", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="VALUE", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafd0079f10>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="VALUE", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="VALUE", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafd837eee0>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="YOJ", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="YOJ", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafde7a5610>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="YOJ", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="YOJ", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafdda41040>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="DEROG", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="DEROG", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc1089070>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="DEROG", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="DEROG", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc26a98b0>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="DELINQ", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="DELINQ", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc10dd7c0>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="DELINQ", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="DELINQ", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc506ea30>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="CLAGE", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="CLAGE", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc508f610>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="CLAGE", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="CLAGE", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc81650a0>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="NINQ", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="NINQ", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc817b280>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="NINQ", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="NINQ", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafc514c250>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="CLNO", x="JOB", hue="BAD")
plt.show()
sns.displot(data=df, x="CLNO", bins=50, col="JOB", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafb06be280>
plt.figure(figsize=(10, 10), dpi=200)
sns.boxplot(data=df, y="CLNO", x="REASON", hue="BAD")
plt.show()
sns.displot(data=df, x="CLNO", bins=50, col="REASON", row="BAD")
<seaborn.axisgrid.FacetGrid at 0x7fafb60e7460>
# Visualizing pair plots
sns.pairplot(df, hue="BAD")
<seaborn.axisgrid.PairGrid at 0x7fafb53fee50>
# Checking pairwise correlations
plt.figure(figsize=(20, 10))
sns.heatmap(df.corr(), annot=True, fmt=".2f")
plt.show()
Observations:
corr_bad = df.corr()["BAD"].sort_values().iloc[1:-1]
plt.figure(figsize=(10, 4), dpi=200)
sns.barplot(x=corr_bad.index, y=corr_bad.values)
plt.title("Correlation with BAD")
plt.xticks(rotation=90)
plt.figure(figsize=(10, 4), dpi=200)
sns.scatterplot(data=df, x="DELINQ", y="DEBTINC", hue="BAD", alpha=0.5, linewidth=0.3)
<AxesSubplot: xlabel='DELINQ', ylabel='DEBTINC'>
plt.figure(figsize=(10, 4), dpi=200)
sns.scatterplot(data=df, x="DELINQ", y="CLNO", hue="BAD", alpha=0.5, linewidth=0.3)
<AxesSubplot: xlabel='DELINQ', ylabel='CLNO'>
Outliers should be carefully investigated as it is not acceptable to drop an observation just because it is an outlier. However, we can try an check if the outlier is due to incorrectly entered or measured data.
# Make copy of df
df_outliers = df.copy()
# Function to calculate the boxplot whiskers for a particular column
def calculate_whiskers(col):
# calculate IQR
q1 = df_outliers[col].quantile(0.25)
q3 = df_outliers[col].quantile(0.75)
iqr = q3 - q1
# find higher and lower fence
higher_whisker = q3 + (1.5 * iqr)
lower_whisker = q1 - (1.5 * iqr)
return lower_whisker, higher_whisker
# Function to create a flag variable for outliers for a particular column
def flag_outliers(col):
lower_fence = calculate_whiskers(col)[0]
higher_fence = calculate_whiskers(col)[1]
df_outliers[col + "_outlier"] = df_outliers[col].apply(
lambda x: 1 if x < lower_fence or x > higher_fence else 0
)
# Creating flag variables for all numerical columns
for num_col in num_cols:
flag_outliers(num_col)
df_outliers
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | ... | LOAN_outlier | MORTDUE_outlier | VALUE_outlier | YOJ_outlier | DEROG_outlier | DELINQ_outlier | CLAGE_outlier | NINQ_outlier | CLNO_outlier | DEBTINC_outlier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5960 rows × 23 columns
# Comparing summary statistics for outliers vs non-outliers
for num_col in num_cols:
print(df_outliers.groupby(num_col + "_outlier")[num_col].describe().T)
LOAN_outlier 0 1 count 5704.000000 256.000000 mean 16995.038569 54546.093750 std 7965.198387 12710.464439 min 1100.000000 41700.000000 25% 10900.000000 45000.000000 50% 15900.000000 50350.000000 75% 22300.000000 59850.000000 max 41600.000000 89900.000000 MORTDUE_outlier 0 1 count 5208.000000 234.000000 mean 67623.862942 210347.388889 std 32936.478948 48309.837298 min 2063.000000 159500.000000 25% 45498.250000 179798.000000 50% 63508.000000 197674.000000 75% 87000.000000 231427.250000 max 159000.000000 399550.000000 VALUE_outlier 0 1 count 5528.000000 320.000000 mean 92638.820738 259621.662500 std 38895.830184 87794.165293 min 8000.000000 200459.000000 25% 65000.000000 210241.500000 50% 86908.000000 237424.000000 75% 114220.250000 282984.500000 max 200339.000000 855909.000000 YOJ_outlier 0 1 count 5354.000000 91.000000 mean 8.548067 30.938462 std 7.059202 2.759057 min 0.000000 28.500000 25% 3.000000 29.000000 50% 7.000000 30.000000 75% 13.000000 31.000000 max 28.000000 41.000000 DEROG_outlier 0 1 count 4527.0 725.000000 mean 0.0 1.844138 std 0.0 1.502019 min 0.0 1.000000 25% 0.0 1.000000 50% 0.0 1.000000 75% 0.0 2.000000 max 0.0 10.000000 DELINQ_outlier 0 1 count 4179.0 1201.000000 mean 0.0 2.013322 std 0.0 1.595250 min 0.0 1.000000 25% 0.0 1.000000 50% 0.0 1.000000 75% 0.0 2.000000 max 0.0 15.000000 CLAGE_outlier 0 1 count 5605.000000 47.000000 mean 176.727344 542.175030 std 78.075515 163.180436 min 0.000000 407.261167 25% 114.759019 420.730518 50% 172.432355 475.800000 75% 229.406175 627.930226 max 405.867848 1168.233561 NINQ_outlier 0 1 count 5273.000000 177.000000 mean 0.962261 7.853107 std 1.190782 1.960086 min 0.000000 6.000000 25% 0.000000 6.000000 50% 1.000000 7.000000 75% 2.000000 9.000000 max 5.000000 17.000000 CLNO_outlier 0 1 count 5519.000000 219.000000 mean 20.181011 49.397260 std 8.558459 5.145782 min 0.000000 43.000000 25% 14.000000 46.000000 50% 20.000000 49.000000 75% 26.000000 51.000000 max 42.000000 71.000000 DEBTINC_outlier 0 1 count 4599.000000 94.000000 mean 33.776951 33.924942 std 6.638178 39.424147 min 14.370986 0.524499 25% 29.304810 4.349208 50% 34.880462 13.120850 75% 38.974704 62.055700 max 53.584883 203.312149
# Visualizing bivariate relations between 'BAD' and outliers in anumerical variables
def viz_outliers(col):
(
pd.crosstab(
df_outliers[col + "_outlier"], df_outliers["BAD"], normalize="index"
)
* 100
).plot(kind="bar", figsize=(8, 4), stacked=True)
plt.ylabel("Percentage Default %")
viz_outliers("LOAN")
viz_outliers("MORTDUE")
viz_outliers("VALUE")
viz_outliers("YOJ")
viz_outliers("DEROG")
viz_outliers("DELINQ")
viz_outliers("CLAGE")
viz_outliers("NINQ")
viz_outliers("CLNO")
viz_outliers("DEBTINC")
Given the distortion on the mean that outliers introduce, the median is used for filling missing values in numerical columns.
# Filling missing values in a numerical columns with median
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
The mode is used to fill missing values in categorical columns.
# Filling missing values in a categorical columns with median
df["REASON"] = df["REASON"].fillna(df["REASON"].mode()[0])
df["JOB"] = df["JOB"].fillna(df["JOB"].mode()[0])
# Checking for missing values
check_missing()
| Total Missing | Percent Missing | |
|---|---|---|
| BAD | 0 | 0.0 |
| LOAN | 0 | 0.0 |
| MORTDUE | 0 | 0.0 |
| VALUE | 0 | 0.0 |
| REASON | 0 | 0.0 |
| JOB | 0 | 0.0 |
| YOJ | 0 | 0.0 |
| DEROG | 0 | 0.0 |
| DELINQ | 0 | 0.0 |
| CLAGE | 0 | 0.0 |
| NINQ | 0 | 0.0 |
| CLNO | 0 | 0.0 |
| DEBTINC | 0 | 0.0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5960 non-null float64 3 VALUE 5960 non-null float64 4 REASON 5960 non-null object 5 JOB 5960 non-null object 6 YOJ 5960 non-null float64 7 DEROG 5960 non-null float64 8 DELINQ 5960 non-null float64 9 CLAGE 5960 non-null float64 10 NINQ 5960 non-null float64 11 CLNO 5960 non-null float64 12 DEBTINC 5960 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
# Creating dummy variables for the categorical variables
df = pd.get_dummies(data=df, columns=["REASON", "JOB"], drop_first=True)
df.head()
| BAD | LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | REASON_HomeImp | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | 34.818262 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | 34.818262 | 1 | 0 | 1 | 0 | 0 | 0 |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | 34.818262 | 1 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1 | 1500 | 65019.0 | 89235.5 | 7.0 | 0.0 | 0.0 | 173.466667 | 1.0 | 20.0 | 34.818262 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | 34.818262 | 1 | 1 | 0 | 0 | 0 | 0 |
# Separating the target variable and other variables
y = df["BAD"]
X = df.drop(columns=["BAD"])
# Scaling the data
X_scaled = StandardScaler().fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | REASON_HomeImp | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.562299 | -1.107920 | -1.099371 | 0.240237 | -0.280976 | -0.375943 | -1.018026 | -0.102879 | -1.230878 | 0.106963 | 1.532421 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 1 | -1.544453 | -0.069286 | -0.582794 | -0.241936 | -0.280976 | 1.477341 | -0.689350 | -0.707574 | -0.728389 | 0.106963 | 1.532421 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 2 | -1.526606 | -1.398407 | -1.491970 | -0.655226 | -0.280976 | -0.375943 | -0.358680 | -0.102879 | -1.130380 | 0.106963 | 1.532421 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 3 | -1.526606 | -0.187596 | -0.216389 | -0.241936 | -0.280976 | -0.375943 | -0.071488 | -0.102879 | -0.125403 | 0.106963 | -0.652562 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 4 | -1.508759 | 0.582831 | 0.183939 | -0.792990 | -0.280976 | -0.375943 | -1.030391 | -0.707574 | -0.728389 | 0.106963 | 1.532421 | 2.299330 | -0.899944 | -0.521936 | -0.136489 | -0.182938 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5955 | 6.272411 | -0.369856 | -0.199691 | 0.997936 | -0.280976 | -0.375943 | 0.506990 | -0.707574 | -0.527394 | 0.276259 | -0.652562 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 5956 | 6.281335 | -0.433030 | -0.151296 | 0.997936 | -0.280976 | -0.375943 | 0.350032 | -0.707574 | -0.627892 | 0.243243 | -0.652562 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 5957 | 6.299181 | -0.445509 | -0.151524 | 0.860173 | -0.280976 | -0.375943 | 0.392963 | -0.707574 | -0.627892 | 0.203553 | -0.652562 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 5958 | 6.352722 | -0.531880 | -0.170218 | 0.722409 | -0.280976 | -0.375943 | 0.412264 | -0.707574 | -0.527394 | 0.044510 | -0.652562 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
| 5959 | 6.361645 | -0.568520 | -0.221691 | 0.860173 | -0.280976 | -0.375943 | 0.480572 | -0.707574 | -0.527394 | 0.074683 | -0.652562 | -0.434909 | 1.111180 | -0.521936 | -0.136489 | -0.182938 |
5960 rows × 16 columns
# Splitting the data into 70% train and 30% test sets
# scaled X
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=1, stratify=y
)
# not scaled X
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1, stratify=y
)
# Function to calculate and print the classification report and the confusion matrix
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8, 5))
sns.heatmap(
cm,
annot=True,
fmt=".2f",
xticklabels=["Not Default", "Default"],
yticklabels=["Not Default", "Default"],
)
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()
# Function to compute different metrics to check classification model performance
def model_performance_classification(model, predictors, target):
# Predicting using the independent variables
pred = model.predict(predictors)
recall = recall_score(target, pred, average="macro") # To compute recall
precision = precision_score(target, pred, average="macro") # To compute precision
acc = accuracy_score(target, pred) # To compute accuracy score
# Creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Precision": precision,
"Recall": recall,
"Accuracy": acc,
},
index=[0],
)
return df_perf
# Fitting the logistic regression model
lg = LogisticRegression()
lg.fit(X_train_scaled, y_train)
LogisticRegression()
# Checking the performance on the training data
y_pred_train = lg.predict(X_train_scaled)
metrics_score(y_train, y_pred_train)
precision recall f1-score support
0 0.85 0.97 0.91 3340
1 0.72 0.34 0.46 832
accuracy 0.84 4172
macro avg 0.79 0.65 0.68 4172
weighted avg 0.83 0.84 0.82 4172
# Checking the performance on the test dataset
y_pred_test = lg.predict(X_test_scaled)
metrics_score(y_test, y_pred_test)
precision recall f1-score support
0 0.84 0.97 0.90 1431
1 0.70 0.28 0.40 357
accuracy 0.83 1788
macro avg 0.77 0.63 0.65 1788
weighted avg 0.81 0.83 0.80 1788
Observations:
# Printing the coefficients of logistic regression
cols = X.columns
coef_lg = lg.coef_
pd.DataFrame(coef_lg, columns=cols).T.sort_values(by=0, ascending=False)
| 0 | |
|---|---|
| DELINQ | 0.853804 |
| DEBTINC | 0.537468 |
| DEROG | 0.485804 |
| NINQ | 0.240948 |
| VALUE | 0.219583 |
| REASON_HomeImp | 0.124135 |
| JOB_Sales | 0.098761 |
| JOB_Self | 0.090914 |
| JOB_ProfExe | -0.028720 |
| JOB_Other | -0.035297 |
| YOJ | -0.108197 |
| CLNO | -0.139264 |
| JOB_Office | -0.198961 |
| LOAN | -0.227761 |
| MORTDUE | -0.229629 |
| CLAGE | -0.455237 |
y_scores_lg = lg.predict_proba(
X_train_scaled
) # predict_proba gives the probability of each observation belonging to each class
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(
y_train, y_scores_lg[:, 1]
)
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10, 7))
plt.plot(thresholds_lg, precisions_lg[:-1], "b--", label="precision")
plt.plot(thresholds_lg, recalls_lg[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.show()
Observation:
Let's find out the performance of the model at this threshold.
optimal_threshold1 = 0.28
y_pred_train = lg.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train[:, 1] > optimal_threshold1)
precision recall f1-score support
0 0.89 0.89 0.89 3340
1 0.55 0.56 0.55 832
accuracy 0.82 4172
macro avg 0.72 0.72 0.72 4172
weighted avg 0.82 0.82 0.82 4172
Observation:
Let's check the performance on the test data.
y_pred_test = lg.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:, 1] > optimal_threshold1)
precision recall f1-score support
0 0.88 0.90 0.89 1431
1 0.55 0.50 0.53 357
accuracy 0.82 1788
macro avg 0.71 0.70 0.71 1788
weighted avg 0.81 0.82 0.82 1788
lg_test = model_performance_classification(lg, X_test_scaled, y_test)
lg_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.771535 | 0.625032 | 0.832215 |
Observations:
# Building decision tree model
dt = DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
# Fitting decision tree model
dt.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
# Checking performance on the training dataset
y_pred_train_dt = dt.predict(X_train)
metrics_score(y_train, y_pred_train_dt)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
# Checking the performance on the test dataset
y_pred_test_dt = dt.predict(X_test)
metrics_score(y_test, y_pred_test_dt)
precision recall f1-score support
0 0.91 0.94 0.92 1431
1 0.70 0.61 0.65 357
accuracy 0.87 1788
macro avg 0.80 0.77 0.79 1788
weighted avg 0.87 0.87 0.87 1788
dtree_test = model_performance_classification(dt, X_test, y_test)
dtree_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.804214 | 0.774228 | 0.870805 |
Observation:
# Plotting feature importances
importances = dt.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(
importances, index=columns, columns=["Importance"]
).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(13, 5))
sns.barplot(data=importance_df, x=importance_df.Importance, y=importance_df.index)
<AxesSubplot: xlabel='Importance'>
Observations:
Let's explore if we can improve the performance of this decision tree model by tuning some of the parameters.
Criterion {“gini”, “entropy”}
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf
The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
You can learn about more Hyperpapameters on this link and try to tune them.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# Choosing the type of classifier
dtree_estimator = DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7),
"criterion": ["gini", "entropy"],
"min_samples_leaf": np.arange(5, 55, 5),
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score, pos_label=1)
# Running the grid search
gridCV = GridSearchCV(dtree_estimator, parameters, scoring=scorer, cv=10)
# Fitting the grid search on the train data
gridCV = gridCV.fit(X_train, y_train)
# Setting the classifier to the best combination of parameters
dtree_estimator = gridCV.best_estimator_
# Fitting the best estimator to the data
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, max_depth=6,
min_samples_leaf=40, random_state=1)
# Checking performance on the training dataset
y_train_pred_dt = dtree_estimator.predict(X_train)
metrics_score(y_train, y_train_pred_dt)
precision recall f1-score support
0 0.96 0.84 0.90 3340
1 0.57 0.86 0.69 832
accuracy 0.84 4172
macro avg 0.77 0.85 0.79 4172
weighted avg 0.88 0.84 0.85 4172
Observation:
# Checking performance on the test dataset
y_test_pred_dt = dtree_estimator.predict(X_test)
metrics_score(y_test, y_test_pred_dt)
precision recall f1-score support
0 0.94 0.84 0.89 1431
1 0.55 0.80 0.65 357
accuracy 0.83 1788
macro avg 0.75 0.82 0.77 1788
weighted avg 0.86 0.83 0.84 1788
dtree_tuned_test = model_performance_classification(dtree_estimator, X_test, y_test)
dtree_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.745099 | 0.8167 | 0.82774 |
Observations:
importances = dtree_estimator.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(
importances, index=columns, columns=["Importance"]
).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(13, 5))
sns.barplot(data=importance_df, x=importance_df.Importance, y=importance_df.index)
<AxesSubplot: xlabel='Importance'>
Observations:
Let's plot the tree and visualize it up to a max_depth of 4.
features = list(X.columns)
plt.figure(figsize=(30, 20), dpi=200)
tree.plot_tree(
dtree_estimator,
max_depth=4,
feature_names=features,
filled=True,
fontsize=12,
class_names=True,
)
plt.show()
Blue leaves represent the borrower defaulting, i.e. y[1] and the orange leaves represent the borrower repaying the HELOC, i.e. y[0]. Also, the more the number of observations in a leaf, the darker its color gets.
Observations:
Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.
The results from all the decision trees are combined together and the final prediction is made using voting or averaging.
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train_rf)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observation:
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support
0 0.91 0.97 0.94 1431
1 0.84 0.60 0.70 357
accuracy 0.90 1788
macro avg 0.87 0.79 0.82 1788
weighted avg 0.89 0.90 0.89 1788
rf_estimator_test = model_performance_classification(rf_estimator, X_test, y_test)
rf_estimator_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.87465 | 0.785744 | 0.897651 |
Observations:
Let's check the feature importance of the Random Forest
importances = rf_estimator.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(
importances, index=columns, columns=["Importance"]
).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(13, 5))
sns.barplot(data=importance_df, x=importance_df.Importance, y=importance_df.index)
Observations:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(
class_weight={0: 0.2, 1: 0.8}, random_state=1
)
# Grid of parameters to choose from
params_rf = {
"n_estimators": np.arange(50, 550, 50),
"min_samples_leaf": np.arange(1, 4, 1),
"max_features": [0.7, 0.9, "auto"],
"max_depth": np.arange(2, 7),
"criterion": ["gini", "entropy"],
# "min_samples_leaf": np.arange(5, 55, 5),
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = make_scorer(recall_score, pos_label=1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_depth=3,
max_features=0.7, n_estimators=50, random_state=1)
# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train_rf_tuned)
precision recall f1-score support
0 0.95 0.88 0.91 3340
1 0.63 0.79 0.70 832
accuracy 0.87 4172
macro avg 0.79 0.84 0.81 4172
weighted avg 0.88 0.87 0.87 4172
# Checking performance on the test data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)
metrics_score(y_test, y_pred_test_rf_tuned)
precision recall f1-score support
0 0.93 0.88 0.91 1431
1 0.61 0.74 0.67 357
accuracy 0.86 1788
macro avg 0.77 0.81 0.79 1788
weighted avg 0.87 0.86 0.86 1788
rf_estimator_tuned_test = model_performance_classification(
rf_estimator_tuned, X_test, y_test
)
rf_estimator_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.773526 | 0.813147 | 0.855705 |
Observations:
# Plotting feature importance
importances = rf_estimator_tuned.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(
importances, index=columns, columns=["Importance"]
).sort_values(by="Importance", ascending=False)
plt.figure(figsize=(13, 5))
sns.barplot(data=importance_df, x=importance_df.Importance, y=importance_df.index)
<AxesSubplot: xlabel='Importance'>
Observations:
DEBTINC). The model also seems to suggest that the number of delinquent credit lines (DELINQ) is a very important feature as well. It was also identified as the second most important feature by the tuned Decision Tree model.DEROG), and the age of the oldest credit line in months (CLAGE).# Adaboost Classifier
adaboost_model = AdaBoostClassifier(random_state=1)
# Fitting the model
adaboost_model.fit(X_train, y_train)
# Model Performance on the test data
adaboost_model_perf_test = model_performance_classification(
adaboost_model, X_test, y_test
)
adaboost_model_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.850194 | 0.759139 | 0.883669 |
# Gradient Boost Classifier
gbc = GradientBoostingClassifier(random_state=1)
# Fitting the model
gbc.fit(X_train, y_train)
# Model Performance on the test data
gbc_perf_test = model_performance_classification(gbc, X_test, y_test)
gbc_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.871946 | 0.768585 | 0.892058 |
# XGBoost Classifier
xgb = XGBClassifier(random_state=1, eval_metric="logloss")
# Fitting the model
xgb.fit(X_train, y_train)
# Model Performance on the test data
xgb_perf_test = model_performance_classification(xgb, X_test, y_test)
xgb_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.891455 | 0.82636 | 0.91387 |
# Checking performance on the training data
y_pred_train_xgb = xgb.predict(X_train)
metrics_score(y_train, y_pred_train_xgb)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
# Checking performance on the test data
y_pred_test_xgb = xgb.predict(X_test)
metrics_score(y_test, y_pred_test_xgb)
precision recall f1-score support
0 0.92 0.97 0.95 1431
1 0.86 0.68 0.76 357
accuracy 0.91 1788
macro avg 0.89 0.83 0.85 1788
weighted avg 0.91 0.91 0.91 1788
models_test_comp_df = pd.concat(
[
lg_test.T,
dtree_test.T,
dtree_tuned_test.T,
rf_estimator_test.T,
rf_estimator_tuned_test.T,
adaboost_model_perf_test.T,
gbc_perf_test.T,
xgb_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression classifier",
"Decision Tree classifier",
"Tuned Decision Tree classifier",
"Random Forest classifier",
"Tuned Random Forest classifier",
"Adaboost classifier",
"Gradientboost classifier",
"XGBoost classifier",
]
models_test_comp_df
| Logistic Regression classifier | Decision Tree classifier | Tuned Decision Tree classifier | Random Forest classifier | Tuned Random Forest classifier | Adaboost classifier | Gradientboost classifier | XGBoost classifier | |
|---|---|---|---|---|---|---|---|---|
| Precision | 0.771535 | 0.804214 | 0.745099 | 0.874650 | 0.773526 | 0.850194 | 0.871946 | 0.891455 |
| Recall | 0.625032 | 0.774228 | 0.816700 | 0.785744 | 0.813147 | 0.759139 | 0.768585 | 0.826360 |
| Accuracy | 0.832215 | 0.870805 | 0.827740 | 0.897651 | 0.855705 | 0.883669 | 0.892058 | 0.913870 |
Observation: